{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# NLP Architect - NP Semantic segmentation tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's import all the relevant classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.chdir(os.path.dirname(os.path.dirname(os.getcwd())))\n",
    "os.getcwd()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from examples.np_semantic_segmentation.data import NpSemanticSegData\n",
    "from examples.np_semantic_segmentation.preprocess_tratz2011 import *\n",
    "from examples.np_semantic_segmentation.data import *\n",
    "from nlp_architect.models.np_semantic_segmentation import NpSemanticSegClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is the download the dataset into a folder. You can download Tratz 2011 et al. dataset from the following link: [Tratz 2011 Dataset](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz). Is also available in [here](https://www.isi.edu/publications/licensed-sw/fanseparser/index.html). (The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After downloading and unzipping the dataset, the following method will labels some portion of the data, and will output two `.csv` files that will assist us to train and evaluate the trained model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_path = '<Tratz2011_dataset_local_path>'\n",
    "preprocess_tratz_2011(dataset_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the dataset is saved and labeled we need to vectories the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# labeled_data_path is the output of preprocess_tratz_2011()\n",
    "labeled_train_data_path = os.path.join(dataset_path,'tratz2011_coarse_grained_random/train.csv')\n",
    "labeled_val_data_path = os.path.join(dataset_path,'tratz2011_coarse_grained_random/val.csv')\n",
    "word2vec_path = '<local_path_to_word_embeddings>/GoogleNews-vectors-negative300.bin.gz'\n",
    "# output_path is location to save the vectors\n",
    "train_output_path = 'nlp_architect/data/np_semantic_segmentation/prepared_data_train.csv'\n",
    "val_output_path = 'nlp_architect/data/np_semantic_segmentation/prepared_data_val.csv'\n",
    "http_proxy = None\n",
    "https_proxy = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "prepare_data(labeled_train_data_path, train_output_path, word2vec_path, http_proxy, https_proxy)\n",
    "prepare_data(labeled_val_data_path, val_output_path, word2vec_path, http_proxy, https_proxy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need to load the data into NpSemanticSegmentation object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set = NpSemanticSegData(train_output_path, train_to_test_ratio=0.8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "    model_file_path = 'np_semantic_segmentation.h5'\n",
    "    num_epochs = 200\n",
    "    model = NpSemanticSegClassifier(num_epochs=200, callback_args=None)\n",
    "    input_dim = data_set.train_set_x.shape[1]\n",
    "    model.build(input_dim)\n",
    "    model.fit(data_set.train_set)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! We now have a MLP classifier for collocations. Let's evaluate it on the val_data_set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_dataset = data_set = NpSemanticSegData(val_output_path, train_to_test_ratio=1)\n",
    "\n",
    "loss, binary_accuracy, precision, recall, f1 = model.eval(val_dataset.train_set)\n",
    "print('loss = %.1f%%' % (loss))\n",
    "print('Test binary_accuracy rate = %.1f%%' % (binary_accuracy * 100))\n",
    "print('Test precision rate = %.1f%%' % (precision * 100))\n",
    "print('Test recall rate = %.1f%%' % (recall * 100))\n",
    "print('Test f1 rate = %.1f%%' % (f1 * 100))\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
